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Method for Alignment of DNA Sequences with 
Enhanced Accuracy and Read Length 

Background of the Invention 

This application relates to DNA sequencing technology and in particular to a 
method for alignment of DNA sequences which provides enhanced accuracy and read-length. 

5 DNA sequencing is generally performed today using one of two methodologies: 

the chemical degradation method or the chain termination method. Of these, the chain 
termination method originally described by Sanger et al., Proc. Natl. Acad. ScL USA 74: 5463- 
5467 (1977) or variations thereof have been adopted in many cases for development of 

O automated sequencing instruments and protocols. 

1 6 In the chain termination sequencing method, fragments are generated using chain 

termination reagents in a template-dependant polymerization reaction. The lengths of the 
j 8 * fragments indicate the positions of one species of base in a target polynucleotide. If fragment 
!=£ sets are generated for each of the four species of bases (A, C, G and T), analysis of the fragment 
q sizes permits the explicit determination of the sequence of the target polynucleotide. While the 
frS translation of this conceptual methodology into practice is effective for determination of 
.-C sequences, the application in automated systems has faced numerous challenges. These include 
X the fact that the band shape produced following electrophoresis of real fragments is not 

consistent from one band to the next and may not be perfectly straight (smiling may occur); 
variations which can occur in peak spacing from one lane of a gel to the next; variations in peak 
'20 spacing which can occur as the length of the run increases; and decreases in resolution which 
occur as the length of the run increases. Furthermore, since much of the cost associated with 
DNA sequencing is in the set-up time involved, for clinical and diagnostic applications the larger 
the length of DNA which can be sequenced with accuracy, the smaller the per patient cost can be. 
These considerations have led to a variety of proposals for improving the chemistry used in 
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sequencing, or for improving the manner in which data representing the detected sequencing 
fragment is processed. The present invention relates to the second type of improvement. 

In order to obtain meaningful sequence information from raw data obtained by 
electrophoresis of labeled sequencing fragments, one of the most important factors is the 
5 alignment of the data traces representing each species of base. In non-automated systems, this is 
frequently done by eye-ball, and the eye of a skilled technician is in fact a remarkable tool for this 
purpose. Commonly assigned US Patent No. 5,916,747, which is incorporated herein by 
reference, discloses a method for aligning data traces from four channels of an automated 
electrophoresis detection apparatus in which each channel detects the products of one of four 
10 chain-termination DNA sequencing reactions such that the four channels together provide 
M ; information concerning the sequence of all four bases within a nucleic acid polymer being 
III analyzed. The method places the four data traces in a trial alignment, and then determines 

coefficients of shift and stretch for selected data points within each normalized data trace to 
1*7 optimize a cost function which reflects the extent of overlap of peaks in the combined 
J# normalized data traces to which the coefficients have been applied. Warp functions are then 
p generated for the normalized data traces from the coefficients of shift and stretch determined for 
!™ the selected data points, and applied to the respective data trace to produce four warped data 
Hp traces which are assembled to form an aligned data set. This data set is then used for base-calling 
g to complete the sequence determination process. 

20 The procedure of the c 747 patent is generally suited for the determination of 

sequences where explicit data for the positions of all four bases are obtained. On the other hand, 
it is not always necessary to determine the positions of all of four species of bases in order to 
obtain diagnostic information from a given polynucleotide. (See, commonly assigned US Patent 
No. 5,834,189, which is incorporated herein by reference). Commonly assigned US Patent No. 

25 5,853,979, which is incorporated herein by reference discloses a method for the interpretation of 
experimental fragment patterns for polynucleotides having putatively known sequences. In this 
method, at least one raw fragment pattern representing the positions of a selected nucleotide base 
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as a function of migration time or distance is obtained for the experimental sample. The 
fragment pattern is evaluated to determine one or more "normalization coefficients." These 
normalization coefficients reflect the displacement, stretching or shrinking, and rate of stretching 
or shrinking of the clean fragment, or segments thereof, which are necessary to obtain a suitably 

5 high degree of correlation between the clean fragment pattern and a standard fragment pattern 
which represents the positions of the selected nucleic acid base within a standard polymer 
actually having the known sequence as a function of migration time or distance. The normali- 
zation coefficients are then applied to the fragment pattern to produce a normalized fragment 
pattern which is used for base-calling in a conventional manner. As indicated, however, this 

1 0 technique requires prior knowledge of the expected fragment pattern for the polynucleotide being 

y analyzed. 

i j Notwithstanding such techniques, there remains room for improvement in the 

K manner in which automated analysis of sequencing fragment patterns are carried out. In 
P particular, there remains a need for systems which allow enhanced read-length, i.e., the analysis 
!=§ of a greater number of bases in a single lane of a gel, without loss of accuracy or substantial 
U increase in analysis time. It is an object of the present invention to provide a method which 
jfl answers this need. 

H Summary of the Invention 

20 The present invention provides a method for aligning sequence data traces. In 

accordance with the invention, an experimental data trace representing the positions of a first 
species of base within a target polynucleotide and a reference data trace representing the 
positions of a second species of base (which may be the same as or different from the first 
species) within a reference polynucleotide are obtained by separating appropriate sequencing 

25 fragments generated from the target and reference polynucleotides in a common lane of an 
electrophoresis gel. For each reference data trace, a plurality of peaks corresponding to 
fragments having a size in the range of 40 to 1200 bases are selected. A base number is assigned 



-3- 



VGEN.P-056-US 
PATENT APPLICATION 

to each of the selected peaks in the reference data trace, and a numerical "peak file" is created 
with information about the peak number and migration time (or distance). This peak file is 
analyzed to determine a set of polynomial coefficients which will allow substantial linearization 
of a plot of peak number versus separation between adjacent peaks and alignment of the traces 
5 with respect to each other. These coefficients are used to create a corrected time scale identifying 
where peaks should be located on a given experimental gel. This corrected time scale is used to 
guide the sampling of the experimental data, and for assignment of peaks within the data. 

Brief Description of the Drawings 
I o Fig. 1 shows a plot of peak spacing versus peak number for unaligned data, and 

O data aligned with third and fifth order polynomials ; 

m Fig. 2 shows a plot of peak spacing versus peak number for data aligned with 

m third, fourth and fifth order polynomials; 

! ** Figs. 3 A and B show plots of the difference, for each lane, between the run time 

i# of a base (322nd nt) and its average value for all 1 6 lanes of a gel. Fig. 3 A corresponds to the 

q run time difference in the raw data; Fig. 3B is the run time difference after alignment; 
y Fig. 4 shows the relationship between accuracy and read length for a first set of 

s ijJ: 

=P experimental data which was well-aligned on the gel; 

[S Fig. 5 shows the relationship between accuracy and read length for a first set of 

20 experimental data which was poorly-aligned on the gel; and 

Fig. 6 shows a system in accordance with the invention. 

Detailed Description of the Invention 

The present invention provides a method for linearization and alignment sequence 
25 data traces. As used herein, the term "linearize" refers to establishing equal spacing in a time 
domain between adjacent peaks within the overall sequence in an experimental data trace. The 
term "align" refers to establishing the correct positions within the overall sequence for the peak 
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in an experimental data trace. When a data trace is obtained for each of the four bases, the 
alignment process results in an explicit determination of the position of each and every base. 
However, since in some instances it is not necessary to perform all four sequencing reactions and 
analyze the results to obtain useful diagnostic data, "alignment" can be performed on a single 
5 trace, representing the positions of a single species of nucleotide base within a target 

polynucleotide. In this case, the single trace after linearization is "aligned" with a standard time 
scale, to show the base numbers associated with peaks within the linearized trace. Alignment can 
also be performed on data sets of two or more traces representing the positions of two or more 
species of nucleotide bases within the target polynucleotide. 
1 o The process of linearization and alignment is essentially one of assigning a correct 

9 numerical position to each of the bases. An important aspect of the linearization and alignment 
W process is compensation for variation in peak spacing which occurs over time even within a 
m single lane of an electrophoresis gel. The present invention performs this compensation by co- 
f? electrophoresing a reference sequence with the experimental sequence and utilizing the resulting 
t5„ reference data trace to define the correct peak spacing. 

□ The specification and claims of this application use the term "DNA sequencing 
S fragments" to describe the mixture of polynucleotides which results when chain extension 

£ polymerization is performed in the presence of a chain-terminating base analog, such as a 

□ dideoxynucleotide triphosphate. The term "DNA sequencing fragments" only requires the 

20 presence in the mixtures of fragments the lengths of which are indicative of the positions of one 

type of base within the polynucleotide being analyzed. 

In the simplest embodiment of the invention, experimental and reference data 

traces obtained from a single lane of an electrophoresis gel are evaluated. The experimental 

polnucleotide may be, for example, the A-sequencing fragments generated from a target 
25 polynucleotide of interest. The reference sample is, for example, the T-sequencing fragments 

generated from a reference polynucleotide of known sequence. Preferably, the reference 
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polynucleotide is of similar total length to the experimental polynucleotide so that the reference 
data extends over the entire length of the experimental sequence information. 

Because the reference polynucleotide has a known sequence, it is possible to 
immediately create a peak table having two columns: actual retention time and peak number. 
5 Thus, for example, if the sequence were 9 bases long, and had the sequence ACATTACGA, then 
the data trace derived from the A-sequencing fragment would have four peaks appearing at times 
T„ T 2 , T 3 and T 4 , respectively. The peak table would therefore appear as follows: 

T, 1 

T 2 3 
10 T 3 6 
I T 4 9 

\ F\ If the spacing of the peaks in the gel over this region were exactly the same, then a plot of T 
m versus peak number would produce a straight line, and a plot of the spacing (the difference 
h" between each adjacent peak) versus peak number would produce a straight, horizontal line. 
H Because experimental data does not meet this ideal, however, the result is in fact far different. 

Q Thus, as shown in Fig. 1, the experimental spacing between adjacent peaks as a function of base 
number may follow a complex curve, at first increasing through a maximum, and then decreasing 

i y 

again. 

!S In accordance with the present invention, a curve fitting procedure is applied to 

20 the raw reference data trace in which the data is fit to a polynomial, generally a third or higher- 
order polynomial. Although this fitting process is generally performed in actual practice using a 
computer program and any of various known curve fitting programs, the procedure employed can 
be understood from the discussion below. In the unaligned data, one is essentially plotting the 
function 

25 AT = mP + c 

where AT is the spacing between adjacent peaks (in units of time), m is the slope of the line, c is 
a constant which is characteristic of the gel and which reflects the characteristic peak spacing, 
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and P is peak number. In ideal data, the slope m is 0, such that there is no actual relationship 
between AT and P, and AT is simply a constant. As one can see in Fig. 1, the experimental data 
are far from being a straight line. In this case, the experimental curve can be approximated by a 
polynomial. The empirical curve aT is fit to a polynomial function by a least squares method: 

5 

AT- a i0 +a iI P + a i2 P 2 + ... + a ik P k . 

The degree (k) of the polynomial is an input parameter of the fitting program. The procedure 
generates a set of coefficient {a ik } for each gel lane (i). A curve fitting program identifies the 
10 coefficients, a i5 (which may be positive or negative) and the constant which bring the resulting 
|S plot of the reference data closest to a straight line. Based on the set of polynomial coefficients, 
\*\ {^J* a corrected time scale is defined for each peak in gel lane #i, according to the formula 

Z T ip = C { [a i0 + a n t ip P + a^P 2 + ... + a lk t ip P k ], 

it where T ip is the corrected time value for the reference peak of length p, t ip is the experimentally 
□ measured run time of this peak, and Q is a scaling factor. This transformation causes the spacing 
py between consecutive peaks in the corrected time domain (dT ip /dp) to remain constant over the 
; Jr course of the run. The transformation (linearization) is performed for both the reference peaks 
O and the sample peaks in each gel lane (i). 

20 Each gel lane has a different scaling factor, C { . For any particular gel, the set of 

values {Cj} is chosen to equalize the spacing between consecutive peaks in the corrected time 
domain, (dT ip /dp), across all lanes of the gel. A gel lane is uniformly compressed by setting C x < 
1, and it is uniformly stretched by setting Q > 1. A set of coefficients {CJ is therefore defined, 
such that all lanes of the gel have the same total run time in the correct time domain. In the 

25 dimension of real time, the data points are evenly spaced. However, in the dimension of 
"corrected time", the corresponding time intervals are not of equal lengths. Therefore the 
experimental data sets are "resampled" into equally-spaced values (in the corrected time domain) 
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by quadratic interpolation. Resampling of the data set for each gel lane is done separately, 
because the corrected time scale may be different for each lane. The procedure described above 
is a global alignment, which precedes any subsequent local alignment by the base calling 
software. This global alignment procedure is general, and should be compatible with all types of 
5 local alignment algorithms. 

The basic methodology described above for alignment of a single data trace can 
also be applied in other embodiments. For example, data can be obtained for all four bases (A,C, 
G and T) in four lanes, to obtain explicit position information for the complete sequence of a 
target polynucleotide. In this case, a set of reference sequencing fragments is desirably run in 
10 each of the four lanes. Further, in multi-lane gels, it is desirable to run a set of reference 
O sequencing fragments in each lane, regardless of the nature of the experimental samples. If a 
m sequencing apparatus is used that is capable of distinguishing between more than two labels, 
S multiple experimental sets of sequencing fragments may be run in one lane along with a set of 
f~ reference sequencing fragments. In each case where more than one reference data trace is 
lf§= obtained from a gel, the spacings of all the reference data traces can be combined to produce a 
q single set of coefficients and single characteristic spacing which is applied to all of the 
!?! experimental data traces from the gel. 

=p Several features are common to all of the various embodiments discussed above. 

2 Each set of the experimental sequencing fragments and the reference sequencing fragments are 

20 labeled with a distinguishable labels, i.e, the labels on the experimental fragments and reference 
frgaments are different from one another when they are present in the same lane of thegel. The 
nature of the labels is a matter of choice and compatability with the detection system employed. 
Suitable labels include radiolabels, chromophores, chromogenic labels and fluorogenic labels. 
Preferred labels, however, are fluorescent labels compatible with automated multi-dye 

25 sequencers. Specific examples of suitable fluorescent labels include cyanine dyes such as Cy5.0 
and Cy5.5 (See US Patents Nos. 4,981,977 and 5,268,486) and energy transfer dyes (U.S. Patent 
No. 5,800,996) and rhodamine dyes (U.S. patents Nos. 5,366,860 and 4,855,225). 
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There is no required relationship between the target polynucleotide and the 
reference polynucleotide, and it is not mandatory that the same set of reference sequencing 
fragments be used in all of the lanes of a gel. This is the case because the alignment depends on 
the measured position of the known bases of the reference trace, but not on the identity of the 
5 bases. However, the reference polynucleotide should be selected to provide enough peaks (or 
bands) to facilitate the use of a desirable dgeree of polynomial for fitting the experimental data. 
For example, if one wishes to use a 5th-degree polynomial, the reference polynucleotide must 
provide at least 6 peaks. 

Furthermore, while it is necessary to know the sequence of the reference 
10 polynucleotide for the creation of the initial peak table, it is not necessary to have any a priori 
;g knowledge of the sequence of the target polynucleotide. Thus, while the present invention is 
«| particularly applicable to diagnostic applications where the putative sequence of the target 
Ijl polynucleotide is known, it is not limited to such applications. 

l-i A further factor which can be adjusted by the user is the number of peaks within 

o the reference data trace that are used in determining the polynomial coefficients and 

□ characteristic spacing. While all of the peaks can be considered, this increases the processing 

lij time and burden. As a practical matter, a much smaller number of peaks can be utilized and still 
:£ provide good alignment of the experimental data traces. For example, for alignment of 

□ sequencing fragments spanning 40 to 1,200 bases, from 3 to 40 peaks in the reference data trace 
20 are suitably selected. The selected peaks are preferably distributed fairly evenly throughout the 

reference data trace, although precisely equal distribution is not required. 

Fig. 6 shows a schematic representation of an apparatus in accordance with the 
^Jl^\> resent invention for evaluating the sequence of a target polynucleotide. The apparatus as shown 
^ ^omprises a processor housing 10 which has an input 1 1 for receiving information about one or 
25 more experimental DNA sequencing data traces derived from the separation of experimental 
DNA sequencing fragments reflecting theNposition of at least one base in the target 
polynucleotide and one or more reference DNA sequencing data traces derived from the 
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separation of reference DNA sequencing fragments reflecting the position of at least one base in 
a reference polynucleotide of knowA sequence. For example, input 1 1 may be in the form of a 
wire for transmitting sequence-relateckdata from a sequencer. Data could also be transmitted via 
a wireless link, or communicated to the\apparatus through disk drive 13. 
5 Within the housing 1 0 is a data processing apparatus 1 4 which include one or 

several processors. The processors or processors are operatively programmed 

(a) to evaluate the reference DNA sequencing data traces to determine a 
corrected time scale indicative of migration times at which peaks should occur; 

(b) to sample the experimental DNA sequencing data traces at time points 
1 0 determined by the corrected time scale; and 

B (c) to assign a base number to each peak found in the experimental DNA 

iH sequencing data traces based upon the corrected time scale, thereby obtaining information about 
m the sequence of the target polynucleotide. The assigned base numbers may be further processed 
]2 to provide an output indicative of information about the sequence of the target polynucleotide 

i — 

H and this information is communicated to the user via an output device. Exemplary output 

□ devices are a display 15 or printer 16. The information may also be communicated by saving it 

St to the disk drive 13 (which can function as either an input or an output device) or through a 

: P telecommunication connection (such as a modem or internet connection). 

q In an embodiment of the invention, the processor programmed to evaluate the 

20 reference DNA sequence data traces is programmed to perform the steps of: 

(i) identifying a plurality of peaks in the reference DNA sequencing data 
traces, and creating a data table containing the number of each peak based on the known 
sequence of the polynucleotide, and the position of each peak in the reference DNA sequencing 
data trace; 

25 (ii) identifying a set of coefficients for a polynomial effective to substantially 

linearize a plot of peak number versus separation between adjacent peaks; and 
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(iii) creating from the coefficients and the polynomial a corrected time scale 
which reflects the positions at which a peak should occur at any given point in a sequencing data 
trace. 

The invention will now be further described and illustrated with reference to the 
5 following, non-limiting examples. 

Example 1 

Lanes 1, 5, 9 and 13 of a standard 16 lane MICROCEL™ electrophoresis gel 
(Visible Genetics Inc.) were loaded with a mixture of the A-terminated sequencing fragments 
10 from Ml 3 labeled with CY5.0 fluorescent cyanine dye label as the experimental sample, and T- 
^- terminated sequencing fragments from Ml 3 labeled with CY5.5 fluorescent cyanine dye label as 
m the reference sequence. Lanes 2, 6, 10 and 14 were loaded with a mixture of C-terminated 
|3l sequencing fragments from M13 labeled with CY5.0 fluorescent cyanine dye label as the 
j~ experimental sample and CY5. 5 -labeled Ml 3 T's as the reference sequence fragments. Lanes 3, 
H 7, 1 1 and 15 were loaded with a mixture of G-terminated sequencing fragments from Ml 3 

□ labeled with CY5.0 fluorescent cyanine dye label as the experimental sample and CY5.5-labeled 
ir?. Ml 3 T's as the reference sequence. Lanes 4, 8, 12 and 16 were loaded with a mixture of T- 

terminated sequencing fragments from Ml 3 labeled with CY5.0 fluorescent cyanine dye label as 

□ the experimental sample and CY5.5-labeled Ml 3 T's as the reference sample. The reference 
20 sequence and the experimental sequence in this example are derived from the same source, and 

indeed in the case of the T-terminated sequencing fragments are identical to the reference 
sequence except for the difference in label. However, the good results for alignment and 
linearization indicate that the reference sequence does not have to be related to the experimental 
sequence in any way. 

25 The labeled DNA molecules were separated by electrophoresis and detected using 

a 638 nm laser excitation source which was detected in real time. The data collection was 
performed on a 2-color DNA Sequencer (Visible Genetics Inc.), to record two channels for each 
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physical lane, one channel reflecting detection of the CY5.0 label affixed to the experimental 
sequencing fragments and one channel reflecting detection of the CY5.5 label affixed to the 
reference fragments. Collected data from the two channels were corrected for overlap in the 
emission spectra of the two labels and the two resulting data traces were saved as a "data file." 
Data analysis on the data file was performed using special software in accordance with the 
protocols of the present invention. 

For each reference channel, several peaks (from 3 to 40 in different experiments) 
were identified having sizes in the range from 40 to about 1200 bases. The base number as 
assigned to each of these peaks based on knowledge of the sequence of the reference sample, and 
the position of each peak in the time scale of the experiment was determined. The information 
about these peaks in the form of a base number and a peak position (or time) was stored in a 
"peak file." 

To align the raw data stored in the data file, the peak data was used to calculate 
the standard number of bases per unit time as an average over the 16 reference channels. The 
data was fit to 3 rd and 5 th order polynomials expressing the relationship base number and peak 
position. Using the fitted polynomial, a corrected time scale was created, so that the reference 
peaks are equally spaced in the corrected time and have the same origin. The number of bases 
per unit corrected time is constant for all the data in the run. However, the actual time interval 
between peaks is not generally constant. Thus, the corrected time scale is used to resample the 
experimental data trace and the associated reference channel. This procedure essentially invovles 
looking at the experimental at the times specified by the corrected time scale, and determining 
whether or not a peak is present at the correct time. 

Figs. 1 and 2 illus&ate the application of the invention to the specific sequences 
^escribed above. Fig,. 1 shows thaspacing between adjacent bases as a function of base 
number, for non-aligned (raw) data (Mosed diamonds), and data aligned and linearized using a 3 rd 
order (open triangles) and 5 th order (op\n circles) polynomials. . It is clearly seen that the spacing 
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is changing during the run significantly, but is linearized by fitting with either the 3 rd or 5 th order 
polynomial. 

Fig. 2 illustrates the influence of the order of the polynomial used for fitting the 
raw data of the experimental traces. Increasing the polynomial from 3 rd order (open triangles) to 

5 4 th order (closed diamonds) improves linearity noticeably, although the curve still have a 

nonlinear part in the beginning of the run (up to about 100 bases). The 5 th order polynomial 
(open circles) gives the best result, with the maximum deviation from the straight line being less 
than about 0.5 seconds up to 1300 bases. Such linearity is close to the limit in this particular 
experiment, because the sampling time was 0.5 seconds. Thus, further increase in the order of 

10 the polynomial would only increase computational time, without being likely to provide any 

y significant improvement in linearity. 

in Figs. 3A and B illustrate improvement in the alignment of the sequencing data 

m (from trace to trace) based on the procedure of the invention. Fig. 3 A shows raw data. The 
1*7 difference in run time can reach 500 seconds. Alignment of the raw data, even with a 3rd-degree 
I# polynomial, improves the data significantly, reducing the difference in run time to a maximum of 
□ ~ 90 seconds. (See. Fig. 3B) When a 5th-degree polynomial is used, the difference becomes less 
jj; than 10 seconds. 

R Example 2 

20 Raw data traces were generated using Ml 3 T-terminated sequencing fragments in 

four adjacent lanes of a sequencing gel. As noted in Table 1 , the raw, unaligned data traces 
showed the substantial variability in peak position that can be observed. Application of a 5 th 
order polynomial to this data to determine a corrected time scale, and the application of this time 
scale to the raw data traces, resulted in a substantial improvement in the alignment of the data. 

25 This improved alignment allows the calling of bases with greater accuracy over the entire 1300 
bases length. 
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Table 1 


Peak Number 


Separation between high and 
low time peaks, before 
alignment 


Separation between high and 
low time peaks, after 
alignment 


40 


1 min 1 6 sec 


14 sec 


140 


2 min 5 sec 


4 sec 


312 


9 min 30 sec 


6 sec 


607 


41 min 38 sec 


4 sec 


970 


almost 1.5 hours 


24 sec 



lQ Example 3 

|d To understand the significance of the number of peaks incorporated in the peak 

^ file for use in generating the polynomial, the data from a T-terminated Ml 3 fragment set was 
U processed using 3, 5, 10, 20 and 40 selected peaks, and the spacing between adjacent peaks at 
!"* various base positions after alignment was determined. The results are shown in Table 2. As can 
H be seen higher numbers of peaks reduce the extent of variation in peak spacing, although even as 
W few as 3 peaks provides useful results. Comparison of the results from 10, 20 and 40 peaks 
K suggests that an increase beyond 40 would only add to the computational burden without 
M improving the quality of the result. 

Example 4 

20 To evaluate the ability of the linearization and alignment processes of the 

invention provide a demonstrable improvement in base calling accuracy and read length, Ml 3 
sequence was used. CY5.0-labeled A, C, G and T-terminated sequence fragments were used as 
experimental samples, while Ml 3 T's labeled with CY5.5 were used as the reference sample. 
Base-calling was performed on the raw data, and on the data after alignment based on 40 peaks 

25 of the reference trace. 
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Spacing : , between ad], Peaks 




40 
62 
95 
1 1 7 
140 
194 
254 
312 
331 
392 
446 
519 
.579 
622 
701 
741 
809 
882 
'322. 
970- 
1026 
1047 



Average 
SQDEV 
Stndrd. Dev 
Max dev 



3pk; 



.10.4 

11.1 

12,4 
13,0 
13.7 
15,0: 

16.4 

17.3. 

17,7- 

18.5: 

i 9.5: 

19.6 

19.2 , 
18.7 ; 
18.0 

17.3 . 
16.4' 
15.8, 

15.2 ' 
14.1 • 

13.4 : : 

12.3 ' ! 



6pK 



10pk 



15.7 
7.6 
2.8 
9.2 



14.6 
15.3 
,16.4, 
16> 
16.7 
, 16.9 
16.9: 
16.7! 
16.3 
16.3 
16.6 
16.4 : 
16.4 ■ 
16.5 
i6.6 . 



1:6.7 
17.0 
16.8 
16.2 
15.7 
15.4 



1.6.3. 
0.3 
0:6 
2.4 



' 15.2 
; 15.8 
i6,7 

: 16.D 
16.7 
1.6.7 
.16.5 

. 16.3 
16.0 
16.2 : 
16.6. 
16.7 
16.7 
16.6 

16.5 

16.4 

16.3. 

16.6 

16.7 

1.6, 5- 

16.5 
16.5 



Tl 



20pk. 



40pk 



16.4 

0.1 : 
.0.4- 
1.6 



: 14.1; 

14.6: 

:5,6. 

15.8 ; 

15.6. 

15.7 : 
•15:6; 
15.4 ; 
-15.1 ; : 
15.2 
1 5.6,' 
1 5.6 

15.5: 

15.5 : ' 
15.4... 
15.4 ■ 
15.3 ; 
15.7 

15.8 

15.5 

15,4 

15.4. 



15.4 
0.1 : 
OA 
1.7 



15.7 
16:1 
17.5 
17.4 
17.4 
17.6 
17.6 
17.5 
17,2 
17.3 
17.2 
17.3 
17.3 
17.6 
17.6 
17.7 
17.4 
17.1 
17.0 
17.0 
16.9 
16.9 



17.2 
0.2 
0.5: 
2.0 
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The relationship between accuracy and read length for each of these two 
experiments is shown in Figs. 4 and 5, respectively. As shown in Fig. 4, for a given accuracy 
(for example 97%), data alignment based on information from a reference channel allows 
increase in read length for at least 10%, i.e., for another 100 bases to be accurately read. 
5 Alternatively, for a given read length (for example 900 bases), it provides improved accuracy 
(98.5% from 97%). These conclusion are based on results of base calling for lanes that were 
relatively well-aligned to begin with. For channels which experience a large shift in the raw data, 
the effect of alignment in accordance with the invention is more pronounced. (Fig. 5). Thus, in 
this experimental system without alignment it is possible to call only 100 bases with reasonable 
10 accuracy. After alignment, however, up to 1000 bases can be called. 
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